Extend dCDH PSU-level wild bootstrap to cell granularity by igerber · Pull Request #332 · igerber/diff-diff

igerber · 2026-04-19T13:58:42Z

Summary

Lift the last NotImplementedError gate in the dCDH survey contract: n_bootstrap > 0 combined with within-group-varying PSU is now supported via a cell-level Hall-Mammen wild multiplier bootstrap.
A dispatcher in _compute_dcdh_bootstrap detects PSU-within-group-constant regimes (PSU=group auto-inject and strictly-coarser PSU with within-group constancy) and routes them through the legacy group-level path so the bootstrap SE stays bit-identical to the previous release.
Under within-group-varying PSU a group contributing cells to multiple PSUs receives independent multiplier draws per PSU (correct Hall-Mammen wild PSU clustering at cell granularity). Multi-horizon bootstraps draw a single shared (n_bootstrap, n_psu) PSU-level weight matrix per block and broadcast per-horizon via each horizon's cell-to-PSU map, so the sup-t simultaneous confidence band remains a valid joint distribution.

Methodology references (required if estimator / math changes)

Method name(s): ChaisemartinDHaultfoeuille survey multiplier bootstrap (_compute_dcdh_bootstrap). Cell-level wild PSU Hall-Mammen extension; cohort-recentered per-cell IF U_centered_per_period (from PR Add cell-period IF allocator for dCDH survey variance #323).
Paper / source link(s): de Chaisemartin & D'Haultfœuille (2020, AER); (2022, NBER WP 29873, Web Appendix §3.7.3). Hall & Mammen (1996) wild bootstrap. The multiplier bootstrap itself is a library extension (the paper prescribes only the analytical plug-in variance); the cell-level allocator extension is documented as such in REGISTRY.md.
Any intentional deviations from the source (and why): Cell-level wild PSU bootstrap under within-group-varying PSU is a library convention extending the same post-period cell-level attribution shipped in PRs Add cell-period IF allocator for dCDH survey variance #323 and Extend dCDH heterogeneity SE to cell-period allocator #329. R DIDmultiplegtDYN does not support survey designs, so "deviation from R" does not apply.

Validation

Tests added/updated:
- tests/test_survey_dcdh.py: flipped the bootstrap gating test to assert finite SE + CI under within-group-varying PSU at both the overall and per-horizon event-study surfaces. Added TestBootstrapCellPeriod with (a) pinned numeric baseline SE captured from pre-PR-4 code (origin/main at ac181b7f) as the bit-identity regression guard; (b) zero-weight-eligible-group smoke test; (c) three negative-contract tests asserting ValueError when psu_varies but the required per-cell tensor is missing or misshaped.
- tests/test_dcdh_bootstrap_cell_period_coverage.py (new, @pytest.mark.slow): 500-rep MC coverage test validating empirical 95% coverage at both the overall bootstrap CI and the horizon-1 event-study bootstrap CI on a DGP with within-group-varying PSU. Horizon-specific assertion guards the shared-PSU-weight multi-horizon path.
Existing regression guards: test_auto_inject_bit_identical_to_group_level (auto-inject invariance), test_coarser_psu_produces_finite_se (strictly-coarser PSU), and the ATT + heterogeneity slow MC coverage tests all pass unchanged.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Lifts the last NotImplementedError gate in the dCDH survey contract: n_bootstrap > 0 combined with within-group-varying PSU is now supported via a cell-level Hall-Mammen wild multiplier bootstrap. Approach: the bootstrap mixin dispatches on whether PSU varies across cells of a group, inspecting a new (n_eligible, n_periods) PSU tensor (sentinel -1 for zero-weight cells) built alongside the existing group-level map. Under PSU-within-group-constant regimes (including PSU=group auto-inject and strictly-coarser PSU with within-group constancy) the dispatcher routes to the legacy group-level bootstrap so the SE is bit-identical to the previous release. Under within-group-varying PSU each (g, t) cell draws its multiplier via the per-cell PSU map and a group contributing cells to multiple PSUs receives independent multiplier draws per PSU — the correct Hall-Mammen wild PSU clustering at cell granularity. Multi-horizon bootstraps draw a single shared (n_bootstrap, n_psu) PSU-level weight matrix once per block and broadcast per-horizon via each horizon's cell-to-PSU map (via psu_codes_per_cell raveled and sentinel-masked), so the sup-t simultaneous confidence band remains a valid joint distribution across horizons. The cohort-recentered per-horizon tensors U_centered_pp_l and U_centered_pp_pl_l are persisted to _mh_pp_cache and _pl_pp_cache dicts inside the existing analytical-SE loops so the bootstrap block can reuse them without recomputing cohort-recentering. Tests: - Flip the bootstrap gating test to assert finite SE + CI under within-group-varying PSU at both the overall and per-horizon event-study surfaces (the multi-horizon path is the highest-risk surface for shared-weight regressions). - Pin a numeric baseline SE captured from pre-PR-4 code (origin/main at SHA ac181b7) under PSU=group and assert bit- identity: test_bootstrap_se_matches_pre_pr4_baseline. Dispatcher regressions that silently drift away from the legacy path fail this guard. - Add zero-weight-eligible-group smoke test. - Add slow-tier MC coverage test (test_dcdh_bootstrap_cell_period_coverage.py) validating empirical 95% coverage at both overall and horizon-1 surfaces on a DGP with within-group-varying PSU. REGISTRY.md Survey + bootstrap contract Note rewritten to describe the cell-level allocator, the dispatcher's bit-identity routing, and the shared-PSU-weight multi-horizon machinery. Cluster contract Note updated; no scope limitation remains in the Survey IF expansion Note. Replicate-weight variance and n_bootstrap > 0 remain mutually exclusive by SurveyDesign construction — no change there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses local AI review P1: the dispatcher previously silently fell back to the legacy group-level bootstrap whenever `psu_varies=True` but a target's per-cell IF tensor was `None`. Under within-group- varying PSU that would silently under-cluster (collapsing multi-PSU group contributions to one PSU). Fix: - Raise ValueError in every target branch (overall, joiners, leavers, multi-horizon, placebo-horizon) when `psu_varies=True` and the target's per-cell tensor is missing. Each raise names the target so the error localizes the missed wiring. - Add shape-check in _unroll_target_to_cells: require `u_per_period_target.shape == psu_codes_per_cell.shape` before flattening, so a silent cell/PSU misalignment fails loudly. - Add three negative-contract tests in TestBootstrapCellPeriod: varying PSU + missing overall tensor, shape mismatch, and varying PSU + missing multi-horizon tensor. - Update stale `_strata_psu_vary_within_group` docstring (the helper no longer gates any out-of-scope combo at fit() time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T14:06:20Z

Overall Assessment

⛔ Blocker

Executive Summary

Affected method: ChaisemartinDHaultfoeuille survey Hall-Mammen wild bootstrap, specifically the new cell-level PSU allocator in _compute_dcdh_bootstrap.
The high-level methodology choice is documented, not a paper/R mismatch: the bootstrap itself and the cell-level PSU extension are explicitly treated as library conventions in diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L1-L39 and docs/methodology/REGISTRY.md:L652-L653.
Two new silent correctness bugs remain in the PSU-map plumbing. Both can change bootstrap SE/CI/p-values with no warning, so they are P0 blockers.
The added zero-weight smoke test is too weak: it only checks “finite” output, so it does not catch the new under-clustering fallback.
I did not run pytest in this container; this is a static review of the diff and surrounding code.

Methodology

Affected method: dCDH survey multiplier bootstrap. The papers cited in the module docstrings justify the analytical plug-in variance; this bootstrap path is a documented library extension, so I am reviewing it against the REGISTRY/bootstrap contract rather than treating the existence of the extension as a defect.

Severity: P0. Impact: In diff_diff/chaisemartin_dhaultfoeuille.py:L2333-L2391, psu_codes_per_cell_bootstrap is only populated inside the valid_map branch. If any variance-eligible group has no positive-weight cells, valid_map flips to False, the entire per-cell tensor is discarded, and _compute_dcdh_bootstrap() falls back to the legacy group-level path because psu_codes_per_cell=None in diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L217-L229. That silently under-clusters bootstrap inference for the remaining positive-weight groups instead of merely dropping the all--1 row, which contradicts the documented sentinel-cell contract in docs/methodology/REGISTRY.md:L653. Concrete fix: always pass psu_codes_per_cell_bootstrap once the tensor is built, even if some rows are all -1; only the legacy group-level map needs per-row validity checks. Then add a regression that the existing zero-weight-group fixture matches the same fit with the zero-weight group removed.
Severity: P0. Impact: The new dense PSU coding is created from all positive-weight rows, not the post-filter eligible target subset, in diff_diff/chaisemartin_dhaultfoeuille.py:L2317-L2331. Because _eligible_group_ids is a filtered subset after the singleton-baseline exclusion in diff_diff/chaisemartin_dhaultfoeuille.py:L5360-L5467, the resulting group_id_to_psu_code_bootstrap can contain gaps. _generate_psu_or_group_weights() then uses max(code)+1 and takes the identity fast path whenever n_psu >= n_groups_target in diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L692-L756. A target map like [1, 1] or [0, 2, 2] will therefore receive independent group draws even though eligible groups still share a PSU. That silently changes bootstrap SEs in the documented “legacy/coarser PSU” regime instead of remaining bit-identical. Concrete fix: densify group_id_to_psu_code_bootstrap on the eligible subset before passing it downstream, or re-factorize group_to_psu_map inside _generate_psu_or_group_weights() before counting/broadcasting.

Code Quality

No separate code-quality findings beyond the two statistical-correctness blockers above.

Performance

No material performance findings. The new Monte Carlo coverage test is marked slow, which is appropriate.

Maintainability

No separate maintainability finding beyond the PSU-code contract bug above; that bug is itself a sign that the dense-code invariant is currently split across too many call sites.

Tech Debt

Severity: P3. Impact: The only nearby dCDH survey item in TODO.md:L59-L60 tracks the broader “cell-period allocator is a library convention” derivation question, not these implementation regressions. Under the review rubric, neither P0 issue is mitigated by existing tracking. Concrete fix: fix the two correctness bugs before merge; only any extra non-blocking follow-up test hardening could be deferred into TODO.md.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new zero-weight regression in tests/test_survey_dcdh.py:L1969-L2003 only asserts that the bootstrap SE stays finite. It therefore passes even when the code silently drops back to unclustered group-level weights. The PR also does not add a regression for the new gapped-code/coarser-PSU failure mode. Concrete fix: strengthen the zero-weight test to compare against the same fit with the zero-weight group removed, and add a coarser-PSU fixture where an ineligible filtered-out group leaves a gap in the eligible PSU codes while the remaining eligible groups still share a PSU.

Path to Approval

Preserve the cell-level bootstrap tensor even when some eligible groups have no positive-weight cells; only exclude those rows at unroll time, not by disabling the entire PSU-aware path.
Re-densify the eligible-subset PSU map before _generate_psu_or_group_weights() decides whether the identity fast path applies.
Add two targeted regressions: one where an all-zero-weight eligible group is equivalent to dropping that group entirely, and one where filtered-out groups create gapped eligible PSU codes but the remaining eligible groups still share a PSU.

Addresses two P0 correctness regressions in the PR-4 bootstrap PSU-map plumbing flagged by CI review. **P0 #1 - valid_map gate discarded the per-cell tensor too eagerly.** When any variance-eligible group had no positive-weight cells (all- sentinel row in psu_codes_per_cell), the old code set valid_map=False and left BOTH group_id_to_psu_code_bootstrap AND psu_codes_per_cell_bootstrap as None. The bootstrap then silently dropped to unclustered group-level instead of excluding only that group's empty row. Fix: always populate psu_codes_per_cell_bootstrap once the tensor is built; the cell-level path already masks out -1 cells at unroll time. Always populate group_id_to_psu_code_bootstrap with a per-group code (use placeholder 0 for all-sentinel rows since those groups have no IF mass and the multiplier they receive is irrelevant on either the legacy or the cell-level path). **P0 #2 - dense PSU codes factorized over non-eligible subset.** `np.unique(obs_psu_codes[pos_mask_boot])` previously included PSU labels from groups that were filtered out of _eligible_group_ids (e.g., singleton-baseline-excluded groups). The excluded groups' PSUs contributed dense codes that formed gaps in the eligible subset's map. Downstream `_generate_psu_or_group_weights` computes `n_psu = max(code) + 1` and triggers the identity fast path when `n_psu >= n_groups_target`. A gapped map like `[1, 1]` or `[0, 2, 2]` silently activated independent-draws clustering for eligible groups that should have shared a multiplier. Fix: restrict the np.unique factorization to the eligible-subset positive-weight obs only (`elig_obs_mask = pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >= 0)`), so the dense code domain exactly matches the PSUs actually used by variance-eligible groups. Tests: - `test_bootstrap_zero_weight_group_equivalent_to_removing_it`: fit with vs without an all-zero-weight eligible group must produce byte-identical bootstrap SE at the same seed (byte- identity would have failed before P0 #1 fix because valid_map flipped the PSU-aware path off for the with-zero-group fit). - `test_bootstrap_dense_codes_under_singleton_baseline_excluded_group`: spies on the group_id_to_psu_code dict passed to `_compute_dcdh_bootstrap` under a fixture with an always-treated singleton-baseline group and strictly-coarser PSU among eligible groups. Asserts the dict's values form a contiguous `[0, n_unique-1]` range (no gaps from the excluded group's PSU), and that eligible groups sharing a PSU label receive the same dense code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T14:13:59Z

/ai-review

github-actions · 2026-04-19T14:22:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ee0cc544958eee05769816325c4567d83be3a32a

Overall Assessment

✅ Looks good

Executive Summary

Affected method: ChaisemartinDHaultfoeuille survey Hall-Mammen wild bootstrap, especially the new cell-level PSU dispatcher in _compute_dcdh_bootstrap.
Methodology-wise, the within-group-varying PSU bootstrap remains a documented library extension rather than an undocumented deviation from the dCDH papers or the registry contract.
The two prior blocker findings appear resolved: PSU codes are now densified on the eligible subset, and the bootstrap now hard-fails on missing cell tensors instead of silently degrading to the wrong path.
I did not find a remaining P0/P1 silent-correctness defect in the changed estimator/inference path.
Two non-blocking follow-ups remain: one misleading warning path still assumes one PSU per group, and the docs/tests are not fully updated to the new supported contract.
I could not run pytest here because this environment lacks pytest and numpy; this is a static review.

Methodology

Severity: P3. Impact: The dCDH papers specify the analytical cohort-recentered plug-in variance, not this bootstrap. The new within-group-varying PSU bootstrap is explicitly documented as a library extension in docs/methodology/REGISTRY.md:L653 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L1-L46, and the implementation changes in diff_diff/chaisemartin_dhaultfoeuille.py:L2302-L2415 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L217-L358 match that documented contract. Concrete fix: None.

Code Quality

Severity: P2. Impact: The pre-bootstrap warning prepass still collapses each eligible group to its first positive-weight PSU label in diff_diff/chaisemartin_dhaultfoeuille.py:L2111-L2148. Now that within-group-varying PSU is supported, that can emit a misleading “strictly coarser PSU” warning on supported cell-level designs. Concrete fix: derive the warning from the dense per-cell PSU map or suppress it whenever the design would take the cell-level path.

Performance

No findings.

Maintainability

No findings beyond the stale warning-path assumption above.

Tech Debt

Severity: P3. Impact: The remaining “cell-period allocator is a library convention, not a full observation-level derivation” limitation is already tracked in TODO.md:L60, so it is mitigated under the review rubric and does not block this PR. Concrete fix: None in this PR.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: REGISTRY still says heterogeneity= plus n_bootstrap > 0 with within-group-varying PSU “still raises NotImplementedError” in docs/methodology/REGISTRY.md:L618, but that gate is no longer part of the fit-time contract. Concrete fix: update that note to match the current behavior, or explicitly clarify that heterogeneity inference remains analytical on bootstrapped fits.
Severity: P2. Impact: The new tests do not fully lock down the prior zero-weight varying-PSU fallback bug. The varying-PSU zero-weight test only checks finiteness in tests/test_survey_dcdh.py:L1969-L2003, while the equivalence regression that is meant to catch silent fallback uses psu=group in tests/test_survey_dcdh.py:L2005-L2058, where the buggy and intended paths collapse to the same identity-draw structure. Concrete fix: add an equivalence regression on a genuinely within-group-varying PSU fixture, comparing the zero-weight case to the same fit with that contribution removed.
Verification note: I could not execute the tests in this environment because pytest and numpy are unavailable.

…rying-PSU equivalence test Addresses the three P2 findings from the CI re-review (all P0s cleared). 1. **Warning prepass assumed one PSU per group** (`chaisemartin_dhaultfoeuille.py:2111-2148`). The old code collected `labels[0]` per eligible group, so a within-group-varying PSU design was mis-counted as having one PSU per group and emitted a misleading "strictly-coarser PSU" UserWarning. Rewrite counts unique PSU labels across all positive-weight obs of eligible groups (not just the first label per group); under PSU=group unchanged, under varying-PSU no false warning. 2. **REGISTRY heterogeneity Note still claimed NotImplementedError** (`REGISTRY.md:618`, "Combining heterogeneity= with n_bootstrap > 0 and within-group-varying PSU still raises NotImplementedError"). That gate was removed in the current PR. Update to clarify that heterogeneity inference stays analytical when bootstrap runs on the main ATT surfaces — the two inference paths are independent. 3. **Zero-weight-equivalence test used `psu=group`** (`test_bootstrap_zero_weight_group_equivalent_to_removing_it`). Under PSU=group both the buggy and correct code paths collapse to the same identity-draw structure, so the test didn't actually exercise the P0 #1 regression. Switch the fixture to within-group-varying PSU (period parity per group) so the cell-level dispatcher is invoked and the before-fix silent-dropback bug would fail this test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T14:25:26Z

/ai-review

github-actions · 2026-04-19T14:38:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c930f74b2db1927747cc768674f7b9f73204ffbf

Overall Assessment

⚠️ Needs changes

Executive Summary

Previous review items look resolved: the PR now densifies PSU codes on the eligible subset, hard-fails when required per-cell tensors are missing, updates the stale survey-bootstrap warning path, and brings the Registry/CHANGELOG into line with the new contract. diff_diff/chaisemartin_dhaultfoeuille.py:L2111-L2149, diff_diff/chaisemartin_dhaultfoeuille.py:L2330-L2352, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L243-L250, docs/methodology/REGISTRY.md:L592-L653
Affected method: the dCDH survey multiplier bootstrap (_compute_dcdh_bootstrap) on the new within-group-varying-PSU cell-level path. As a bootstrap extension beyond the papers, that overall direction is properly documented in the Methodology Registry and is not itself a defect. docs/methodology/REGISTRY.md:L652-L653
New blocker: on supported terminally-missing survey panels, the newly enabled cell-level bootstrap can silently drop non-zero cohort-centered IF mass on missing (g,t) cells, producing incorrect bootstrap SE/CI/p-values for within-group-varying-PSU fits. diff_diff/chaisemartin_dhaultfoeuille.py:L5161-L5205, diff_diff/chaisemartin_dhaultfoeuille.py:L5233-L5268, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L596-L629
Non-blocking: the new slow coverage test claims to validate the shared multi-horizon PSU-weight path, but it runs with L_max=1, so it never exercises the cross-horizon shared-draw/sup-t machinery it describes. tests/test_dcdh_bootstrap_cell_period_coverage.py:L11-L23, tests/test_dcdh_bootstrap_cell_period_coverage.py:L124-L130
Static review only: I could not run pytest because this environment lacks both pytest and numpy.

Methodology

Severity: P1. Impact: The new cell-level bootstrap path is incorrect on a supported edge case: terminal missingness. Raw per-period IF contributions are zeroed on missing transitions via present, which is how the estimator documents support for terminally-missing groups. But _cohort_recenter_per_period() then subtracts cohort means column-by-column across the full period grid, so a group with no observed cell at period t can still acquire non-zero centered mass in that missing column. The new bootstrap path builds psu_codes_per_cell with -1 sentinels for missing/zero-weight cells and _unroll_target_to_cells() drops those entries entirely, so that centered mass is silently discarded before the bootstrap draw. The old NotImplementedError gate prevented this regime; the PR newly enables it for within-group-varying PSU, so bootstrap inference can now be wrong on supported panels with early exit/right-censoring. Concrete fix: before advertising full support, either (a) reject the cell-level bootstrap whenever non-zero U_centered_per_period lands on a psu_codes_per_cell == -1 cell, or (b) redesign the recentering/unroll contract so centered mass is defined only on observed cells and add a regression for a within-group-varying-PSU panel with terminal missingness. diff_diff/chaisemartin_dhaultfoeuille.py:L5161-L5205, diff_diff/chaisemartin_dhaultfoeuille.py:L5233-L5268, diff_diff/chaisemartin_dhaultfoeuille.py:L2354-L2369, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L236-L254, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L596-L629, docs/methodology/REGISTRY.md:L610-L653

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology blocker above.

Tech Debt

Severity: P3. Impact: The broader “cell-period allocator is a library convention, not a formal observation-level derivation” limitation remains properly documented and tracked, so I did not count that general issue against this PR. Concrete fix: None required here. TODO.md:L51-L60, docs/methodology/REGISTRY.md:L652-L653

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new Monte Carlo test overclaims its coverage. Its docstring says it guards the shared multi-horizon PSU-weight path and the sup-t joint distribution, but the fit uses L_max=1, so there is only one horizon and no cross-horizon sharing to validate. A regression in the newly added shared-weight broadcast logic could still ship undetected. Concrete fix: run the slow MC fixture with L_max >= 2 and assert a genuinely multi-horizon surface, e.g. horizon-2 coverage and/or finite cband_crit_value. tests/test_dcdh_bootstrap_cell_period_coverage.py:L11-L23, tests/test_dcdh_bootstrap_cell_period_coverage.py:L87-L130

Verification note: static review only; pytest and numpy are not installed in this environment.

Path to Approval

Fix the within-group-varying-PSU bootstrap so non-zero recentered IF mass cannot be dropped on sentinel/missing cells, or restore a targeted gate for that case.
Add a regression covering a survey panel with within-group-varying PSU plus terminal missingness (or later zero-weight cells) and assert the corrected behavior or the targeted error.

…entinel cells **P1 (methodology):** Under terminal missingness, `_cohort_recenter_per_period` subtracts cohort column means across the full period grid, so a group with no observation at period t acquires non-zero centered mass at that cell. The PR-4 cell-level bootstrap builds `psu_codes_per_cell` with -1 sentinels for such cells and `_unroll_target_to_cells` drops them — silently losing that centered mass. Under within-group-varying PSU + terminal missingness, this would under-cluster the bootstrap SE/CI/p-values. Conservative guard: `_unroll_target_to_cells` now raises `ValueError` when any sentinel cell (-1 PSU) carries non-zero cohort-recentered IF mass (|u| > 1e-12). The error message points users to `n_bootstrap=0` for analytical TSL on such panels. The analytical path has the same mass-leakage behavior under this regime but was shipped in PR #323; documenting the bootstrap- specific guard here avoids advertising a broken combination. Regression test: `test_bootstrap_cell_level_raises_on_sentinel_mass_leak` constructs a per-cell IF tensor with non-zero mass at a -1-PSU cell and asserts `_compute_dcdh_bootstrap` raises with the documented error message. **P2 (tests):** The slow MC bootstrap coverage test previously ran at `L_max=1`, which collapsed the multi-horizon block to a single target and never exercised the cross-horizon shared-weight path described in its own docstring. Bumped to `L_max=2` so the shared (n_bootstrap, n_psu) PSU-level weight matrix is drawn once and broadcast across horizons via each horizon's cell-to-PSU map. Added three assertions: - Horizon-1 bootstrap CI coverage in [0.925, 0.975]. - Horizon-2 bootstrap CI coverage in [0.910, 0.975]. Tolerance is wider than h-1 because finite-sample analytical TSL coverage on this DGP is itself ~0.93 at l=2 (measured offline: analytical h-1 = 0.94, h-2 = 0.926 at n_groups=40). An observed bootstrap coverage within 1pp of the analytical baseline is consistent with correct clustering; a drop to ≤ 0.90 would indicate a real shared-weight broadcast regression. - `cband_crit_value` finite in ≥ 90% of reps — validates that the shared (n_bootstrap, n_psu) weight matrix produces a coherent joint distribution across horizons (required for a valid sup-t simultaneous band). Bumped n_bootstrap to 1000 (from 500) to keep internal bootstrap MC noise below ~0.3pp per CI endpoint at horizon-2's slightly wider percentile-CI spread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T14:46:22Z

/ai-review

github-actions · 2026-04-19T14:54:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9ebb682d61e8a3076b05245827c775458860005f

Overall Assessment

⚠️ Needs changes

Executive Summary

Prior blocker addressed: the new cell-level bootstrap no longer silently drops recentered IF mass on missing/sentinel cells; _unroll_target_to_cells() now fails loudly with a targeted ValueError instead. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:609 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:652
Affected method: the dCDH survey multiplier bootstrap in _compute_dcdh_bootstrap, including the new within-group-varying-PSU cell-level Hall-Mammen path. The papers cited in the module only prescribe analytical plug-in variance; this bootstrap remains a documented library extension, so the allocator choice itself is not a source-material defect. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:5 docs/methodology/REGISTRY.md:663
Remaining P1: the public contract now overclaims support. REGISTRY.md, CHANGELOG.md, and the fit() docstring say within-group-varying PSU bootstrap is supported and the last survey bootstrap gate is gone, but terminally-missing panels are still documented as retained and the new implementation now hard-fails on that regime. docs/methodology/REGISTRY.md:621 docs/methodology/REGISTRY.md:664 CHANGELOG.md:13 diff_diff/chaisemartin_dhaultfoeuille.py:661
Previous P2 test-gap is fixed: the new slow coverage test now uses L_max=2, checks horizon 2, and asserts finite sup-t critical values, so it actually exercises the shared multi-horizon draw path. tests/test_dcdh_bootstrap_cell_period_coverage.py:92
Static review only: pytest and numpy are not installed in this environment.

Methodology

Severity: P1. Impact: The implementation correctly replaces the previous silent-underclustering bug with an explicit failure, but the PR’s documented contract is now inconsistent with the code. The registry still says terminal missingness is retained, while the new bootstrap helper raises on the terminal-missingness/sentinel-mass case; at the same time, the changelog and fit() docs claim within-group-varying PSU bootstrap support is now generally available. That is an undocumented support carve-out in a variance/inference path. Concrete fix: either implement a bootstrap-compatible treatment of recentered mass on missing cells, or document this limitation everywhere the PR currently claims full support and explicitly direct users to n_bootstrap=0 for that regime. docs/methodology/REGISTRY.md:621 docs/methodology/REGISTRY.md:664 CHANGELOG.md:13 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:609
Severity: P3. Impact: The new cell-level wild PSU bootstrap is a documented library extension beyond the dCDH papers’ analytical variance formula, so I did not count the extension itself as a methodology defect. Concrete fix: None. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:5 docs/methodology/REGISTRY.md:663

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: The broader “survey cell-period allocator is a library convention, not a full observation-level derivation” limitation remains tracked in TODO.md and documented in REGISTRY.md, so it does not count against this PR. Concrete fix: None required in this PR. TODO.md:60 docs/methodology/REGISTRY.md:663

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests cover the happy-path varying-PSU bootstrap and the helper-level sentinel-mass guard, but there is still no end-to-end fit() regression for within-group-varying PSU plus terminal missingness. The user-facing failure mode is therefore unpinned even though the public contract changed. Concrete fix: add a fit(..., survey_design=sd, n_bootstrap>0) regression on a panel with later-period missing cells and varying PSU, asserting either the targeted ValueError or finite corrected bootstrap output. tests/test_survey_dcdh.py:1176 tests/test_survey_dcdh.py:1938 tests/test_dcdh_bootstrap_cell_period_coverage.py:92

Path to Approval

Make the public contract consistent with the implementation: either support terminal-missing panels on the new cell-level bootstrap path, or document the limitation in REGISTRY.md, CHANGELOG.md, and the fit()/test docstrings that currently claim full support.
Add an end-to-end regression for fit(..., survey_design=sd, n_bootstrap>0) on a within-group-varying-PSU panel with terminal missingness, asserting the chosen contract.

Verification note: static review only; pytest and numpy are unavailable in this environment.

…ession Addresses the CI round-3 P1 (docs overclaiming support relative to the newly-added `_unroll_target_to_cells` guard) and P2 (no end-to-end fit() regression for terminal missingness + varying PSU + bootstrap). **P1 (docs contract consistency):** The cell-level bootstrap now hard-fails on the terminal-missingness mass-leak regime, but the surrounding docs still advertised full support. Added a "Bootstrap + terminal-missingness scope note" to: - `REGISTRY.md` "terminal missingness retained" Note — describes the cohort-recentering leakage mechanism and directs users to `n_bootstrap=0` for affected panels. - `REGISTRY.md` Survey + bootstrap contract Note — same carve-out, also clarifies that PSU-within-group-constant regimes are unaffected (dispatcher routes to the legacy path). - `CHANGELOG.md` (PR-4 entry) — explicit scope note after the "closes the last NotImplementedError gate" claim. - `fit()` docstring `survey_design` paragraph — scope note directs users to `n_bootstrap=0` as the documented workaround. **P2 (end-to-end fit() regression):** Added `test_bootstrap_fit_raises_on_terminal_missingness_with_varying_psu` in TestBootstrapCellPeriod. Fixture: 10 groups with joiners cohort (at period 3), leavers cohort (at period 4), and never-treated controls; group 2 is terminally missing at periods 4-5. At period 4 the other joiners serve as stable_1 controls for the leavers, producing non-zero cohort mean in cohort A — `_cohort_recenter_per_period` leaks `-col_mean` onto group 2's missing cell. Varying PSU (period parity per group) routes the bootstrap to the cell-level path. Test asserts: - `fit(..., n_bootstrap=50)` raises `ValueError` with the documented "no positive-weight observations" message. - `fit(..., n_bootstrap=0)` succeeds on the same panel — analytical TSL supports this regime (the contract the scope note preserves). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T14:59:45Z

/ai-review

github-actions · 2026-04-19T15:05:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: def65032e3086d1eaa36e073618ea41b577fbc4d

Overall Assessment

⛔ Blocker

Executive Summary

The prior P1 about the new bootstrap contract being undocumented is resolved: the changelog, fit() docstring, and Methodology Registry now all explicitly carve out terminal-missingness on the new cell-level bootstrap path instead of claiming unconditional support. CHANGELOG.md:L13-L13 diff_diff/chaisemartin_dhaultfoeuille.py:L661-L679 docs/methodology/REGISTRY.md:L621-L664
The prior P2 test gap is also resolved: the PR now includes an end-to-end terminal-missingness regression and a slow multi-horizon coverage test that actually exercises the shared PSU-weight bootstrap path. tests/test_survey_dcdh.py:L2105-L2180 tests/test_dcdh_bootstrap_cell_period_coverage.py:L1-L244
Severity: P0 [Newly identified]. The PR now tells users to fall back to n_bootstrap=0 on terminal-missing + within-group-varying PSU panels, but the shared analytical survey IF expansion still has no guard for the same leaked missing-cell mass that the bootstrap now rejects. That means the analytical TSL path can silently drop cohort-centered IF mass and return finite SEs on an unsupported regime. diff_diff/chaisemartin_dhaultfoeuille.py:L5240-L5274 diff_diff/chaisemartin_dhaultfoeuille.py:L5762-L5929 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L596-L665 tests/test_survey_dcdh.py:L2105-L2180
The cell-level Hall-Mammen bootstrap itself remains a documented library extension beyond the dCDH papers’ analytical variance formula, so I did not count that design choice as a source-material defect. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L5-L49 docs/methodology/REGISTRY.md:L663-L664

Methodology

Severity: P0 [Newly identified]. Impact: The affected methods are the dCDH survey bootstrap (_compute_dcdh_bootstrap) and the shared analytical survey variance expansion (_survey_se_from_group_if). The bootstrap path now correctly fails when _cohort_recenter_per_period puts nonzero mass on a missing (g,t) cell, because _unroll_target_to_cells() detects that there is no PSU/observation to receive that mass. The analytical TSL path still expands the same U_centered_per_period tensor to observations via psi_i = U[g_i,t_i] * (w_i / W_{g_i,t_i}) with no analogous guard, so missing-cell mass is silently omitted from psi and can understate SEs with no warning. The PR’s new docs and regression then direct users to exactly that unsupported fallback (n_bootstrap=0) on the offending panel class. Concrete fix: either implement a principled analytical treatment for missing-cell cohort-centered mass, or add the same failure/NaN guard to _survey_se_from_group_if / _compute_se and stop advertising n_bootstrap=0 as a valid workaround until that is done. diff_diff/chaisemartin_dhaultfoeuille.py:L5240-L5274 diff_diff/chaisemartin_dhaultfoeuille.py:L5762-L5929 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L596-L665 diff_diff/chaisemartin_dhaultfoeuille.py:L669-L675 docs/methodology/REGISTRY.md:L621-L664 tests/test_survey_dcdh.py:L2105-L2180
Severity: P3. Impact: The new cell-level wild-PSU bootstrap is explicitly documented as a library extension rather than something prescribed by de Chaisemartin–D’Haultfœuille or by an R reference implementation, so the allocator choice itself is not a methodology defect. Concrete fix: None. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L19-L39 docs/methodology/REGISTRY.md:L663-L664

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The existing survey cell-period allocator TODO does not mitigate the P0 above, because silent variance/SE correctness issues are never deferrable. TODO.md:L60-L60

Security

No findings.

Documentation/Tests

Severity: P2 [Newly identified]. Impact: The new user-facing text and regression now codify the unsupported analytical fallback rather than guarding it. CHANGELOG.md, the fit() docstring, and test_bootstrap_fit_raises_on_terminal_missingness_with_varying_psu all tell users to expect finite analytical output on the same regime where the shared survey IF expansion loses mass. Concrete fix: once the analytical path is either fixed or guarded, update those docs and replace the finite-SE assertion with a regression for the chosen contract (ValueError or all-NaN inference plus warning). CHANGELOG.md:L13-L13 diff_diff/chaisemartin_dhaultfoeuille.py:L669-L675 tests/test_survey_dcdh.py:L2105-L2180

Path to Approval

Add an analytical-path guard or a mathematically valid fix for missing-cell cohort-centered mass under terminal missingness + within-group-varying PSU in _survey_se_from_group_if / _compute_se.
Remove the current n_bootstrap=0 workaround language from the changelog, registry, and fit() docstring unless/until that analytical path is actually valid on this regime.
Replace the new analytical fallback regression with one that asserts the final guarded behavior on the same fixture.

Static review only; numpy/pytest are unavailable in this environment, so I could not execute the new tests.

Addresses the P0 escalation: recommending `n_bootstrap=0` as a workaround for terminal missingness + within-group-varying PSU was incorrect — the analytical TSL path uses the SAME cell-period allocator and has the same silent mass-drop bug. **Analytical guard parity.** `_survey_se_from_group_if` now computes `W_cell` (per-(g,t) weight totals) and, before the cell-to-obs expansion, checks whether any cell has `W_cell == 0` while the corresponding cohort-recentered IF mass is non-zero. If so, raises a targeted `ValueError` mirroring the bootstrap-side `_unroll_target_to_cells` guard. The message text is aligned across the two paths (same "no positive-weight observations" phrasing) so the regression test matches both. **Docs cleanup.** Removed the "use n_bootstrap=0 as workaround" language from REGISTRY, CHANGELOG, and the fit() docstring. Replaced with the correct workaround: pre-process the panel (drop late-exit groups / trim to a balanced sub-panel), or use an explicit `psu=<group_col>` so the dispatcher routes through the legacy group-level path (which does not use the cell-period allocator and is not affected by the mass-leak). **Regression test update.** The end-to-end fit() regression now asserts `ValueError` on BOTH `n_bootstrap=0` and `n_bootstrap > 0` under the terminally-missing + within-group-varying PSU fixture. This is technically a behavior change for panels previously covered silently by PR #323's cell-period analytical allocator — those panels used to produce finite (but silently mass-dropped) SEs and now raise. The change closes a real silent-correctness bug; the analytical path never had a principled treatment for the leaked mass in the first place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T15:10:22Z

/ai-review

github-actions · 2026-04-19T15:25:44Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 262fc6160bd491c51c24726d03dadc412acae548

Overall Assessment
Needs changes

Executive Summary

The prior blocker on silently dropping analytical survey IF mass in the new varying-PSU regime is resolved for that regime: _survey_se_from_group_if() now mirrors the bootstrap sentinel-mass guard, and the new end-to-end regression expects ValueError on both analytical and bootstrap paths. diff_diff/chaisemartin_dhaultfoeuille.py:L5893-L5923 tests/test_survey_dcdh.py:L2105-L2187
Severity: P1 [Newly identified]. The new analytical guard is broader than the documented carve-out. Because every survey analytical surface still threads U_centered_per_period into _compute_se(), and _survey_se_from_group_if() unconditionally switches to the cell allocator whenever that tensor exists, terminal-missing panels with constant psu=group or other within-group-constant PSUs can now hit the same ValueError. That contradicts the PR’s own fit() docstring, Registry note, and changelog, all of which say the carve-out is only for within-group-varying PSU and that explicit psu=<group_col> is the workaround. diff_diff/chaisemartin_dhaultfoeuille.py:L1985-L2056 diff_diff/chaisemartin_dhaultfoeuille.py:L5753-L5923 diff_diff/chaisemartin_dhaultfoeuille.py:L669-L680 docs/methodology/REGISTRY.md:L621-L664 CHANGELOG.md:L13-L13
The cell-level Hall-Mammen bootstrap itself remains a documented library extension beyond the dCDH papers and outside any R parity claim, so I did not count that design choice as a methodology defect. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L19-L39 docs/methodology/REGISTRY.md:L664-L664
The new tests materially improve coverage of the intended varying-PSU path, but they only pin the varying-PSU success/failure contract, not the newly documented constant-PSU terminal-missingness workaround. tests/test_survey_dcdh.py:L1176-L1208 tests/test_survey_dcdh.py:L2105-L2187 tests/test_dcdh_bootstrap_cell_period_coverage.py:L1-L244
One runtime error path still tells users to fall back to n_bootstrap=0, which no longer matches this PR’s analytical guard. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L609-L664

Methodology

Severity: P1 [Newly identified]. Impact: the affected method is the survey-design variance path for ChaisemartinDHaultfoeuille, specifically the analytical TSL branch that now shares the cell-period allocator contract with the new cell-level PSU bootstrap. The code now documents that only terminal missingness plus within-group-varying PSU is unsupported, and that constant-PSU regimes should remain on the legacy group-level analytical sum. But that dispatcher does not exist on the analytical side: fit() always passes the per-cell tensor into _compute_se(), and _survey_se_from_group_if() always enables the cell allocator when that tensor is present, then raises on any leaked mass. The result is a production error on a documented supported regime, not just on the intended varying-PSU carve-out. Concrete fix: branch the analytical SE path on PSU-within-group constancy and use the legacy group-level allocator whenever PSU is constant within group; keep the new leaked-mass ValueError only on the true within-group-varying-PSU cell allocator path. diff_diff/chaisemartin_dhaultfoeuille.py:L1985-L2056 diff_diff/chaisemartin_dhaultfoeuille.py:L5753-L5923 diff_diff/chaisemartin_dhaultfoeuille.py:L669-L680 docs/methodology/REGISTRY.md:L621-L664
Severity: P3. Impact: the cell-level wild PSU bootstrap is explicitly documented as a library extension, not something prescribed by the dCDH papers or by an R reference implementation. Concrete fix: none. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L19-L39 docs/methodology/REGISTRY.md:L664-L664

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new untracked findings. The remaining “cell-period allocator is a library convention, not a formal observation-level derivation” limitation is already tracked in TODO.md and documented in the Registry, so it is mitigated under this repo’s review rules. TODO.md:L60-L60

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests exercise the intended varying-PSU success path and the varying-PSU terminal-missingness failure path, but they do not add the companion regression for the constant-PSU terminal-missingness contract that the docs now advertise. That is why the over-scoped analytical guard above can slip through. Concrete fix: add a survey regression with terminal missingness plus psu="group" and another with a strictly coarser within-group-constant PSU, and assert the analytical fit succeeds with finite SE or matches the legacy group-level analytical path. tests/test_survey_dcdh.py:L1176-L1208 tests/test_survey_dcdh.py:L2105-L2187 docs/methodology/REGISTRY.md:L621-L664
Severity: P3. Impact: _unroll_target_to_cells() still tells users to “use n_bootstrap=0” as the fallback when leaked mass hits a sentinel cell, but this PR’s own analytical guard means that is no longer the right workaround for the varying-PSU carve-out. Concrete fix: update the helper docstring and ValueError text to match the final supported contract. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L609-L664

Path to Approval

Add an analytical dispatcher/fallback so terminal-missing survey panels with PSU constant within group use the legacy group-level allocator instead of the cell allocator.
Add regression tests for terminal missingness under psu="group" and under a strictly coarser within-group-constant PSU, alongside the existing varying-PSU ValueError regression.
Update _unroll_target_to_cells() runtime/docs text so it no longer recommends n_bootstrap=0 as the workaround on the leaked-mass varying-PSU path.

Static review only. numpy, pandas, and pytest are unavailable in this environment, so I could not execute the added tests.

The round-5 analytical guard in _survey_se_from_group_if was over- scoped: it fired on any panel with leaked centered mass, including PSU-within-group-constant regimes (PSU=group auto-inject, strictly- coarser-PSU-within-group-constant). But the docs said those regimes are the intended workaround. Panels with terminal missingness and PSU=group are the documented supported path and must not raise. Fix: add an analytical dispatcher inside `_survey_se_from_group_if` mirroring the bootstrap's `_psu_varies_within_group` routing. When PSU is within-group-constant on the eligible subset, fall back to the legacy group-level allocator (which uses U_centered[g] directly via the row-sum identity and does not trigger the sentinel-mass guard). Only when PSU actually varies within group does the cell allocator run — and only then can the sentinel-mass guard fire. Byte-identity: under PSU=group the row-sum identity makes the cell and group allocators statistically equivalent, but the legacy branch was the one in use before PR #323 introduced the cell allocator. Rerouting to it under within-group-constant regimes is a regression-free fallback (it's what PR #323 aimed to preserve via byte-identity in the first place). Test coverage: added `test_fit_succeeds_on_terminal_missingness_with_psu_group` — same fixture as the varying-PSU failure regression but with auto-inject `psu=<group>`, asserts both `n_bootstrap=0` and `n_bootstrap > 0` succeed with finite SE. Also updated the cell-level bootstrap `ValueError` text (and the _unroll_target_to_cells docstring) to no longer advertise `n_bootstrap=0` as a workaround — both paths now fail consistently on the varying-PSU carve-out, and the correct workaround is pre-processing or using an explicit `psu=<group_col>`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T15:29:37Z

/ai-review

github-actions · 2026-04-19T15:37:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 91eb2c7733cd29bc8ebe7b88e560f2b2d80ad166

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review P1 on terminal-missingness overreaching into PSU-within-group-constant regimes appears fixed: the analytical helper now falls back to the legacy group-level path when PSU is constant within group, and the new psu=group regression covers both analytical and bootstrap fits. diff_diff/chaisemartin_dhaultfoeuille.py#L5866 tests/test_survey_dcdh.py#L2189
The stale bootstrap error/docs path is also fixed: _unroll_target_to_cells() now matches the final contract instead of telling users to fall back to n_bootstrap=0. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L609
One new P1 remains: the new allocator fallback in _survey_se_from_group_if() runs before the helper dispatches to replicate variance, so replicate-weight ATT surfaces now silently lose the cell-period allocator whenever resolved.psu is None (which is every replicate design). That changes replicate SEs on supported fits and is not documented in the current registry/helper contract. diff_diff/chaisemartin_dhaultfoeuille.py#L5866 diff_diff/chaisemartin_dhaultfoeuille.py#L5988 docs/methodology/REGISTRY.md#L663
The cell-level Hall-Mammen bootstrap itself is properly documented as a library extension beyond the dCDH papers / R reference, so I did not count that design choice as a defect. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L19 docs/methodology/REGISTRY.md#L664
Test coverage improved materially for varying-PSU bootstrap and the constant-PSU terminal-missingness workaround, but there is still no regression that locks the Class A replicate allocator contract, which is why the P1 above can slip through. tests/test_survey_dcdh.py#L1176 tests/test_survey_dcdh.py#L1808 tests/test_survey_dcdh_replicate_psu.py#L147

Methodology

Severity: P1. Impact: affected methods are the replicate-weight analytical SE paths for the main ChaisemartinDHaultfoeuille ATT surfaces that flow through _survey_se_from_group_if() (DID_M, DID_+, DID_-, DID_l, and dynamic placebos). The new fallback at _survey_se_from_group_if() now disables the cell allocator before the helper dispatches to compute_replicate_if_variance(). Because replicate designs do not carry PSU labels here, that branch forces all replicate fits onto the legacy group-level allocator. Replicate variance is observation-level and allocator-sensitive when replicate ratios vary within group, so this silently changes SEs/inference on supported designs. That conflicts with the current ATT survey IF contract still documented in the registry/helper text. Concrete fix: make the new constant-PSU fallback TSL-only, e.g. skip it when resolved.uses_replicate_variance is true, or move the replicate-vs-TSL dispatch ahead of the fallback so replicate fits retain their intended allocator. If the intended methodology is actually to move Class A replicate sites to the legacy allocator, that needs an explicit REGISTRY/docstring/CHANGELOG update plus dedicated regression coverage before merge. diff_diff/chaisemartin_dhaultfoeuille.py#L5773 diff_diff/chaisemartin_dhaultfoeuille.py#L5866 diff_diff/chaisemartin_dhaultfoeuille.py#L5988 docs/methodology/REGISTRY.md#L663 tests/test_survey_dcdh.py#L1729
Severity: P3. Impact: the new cell-level wild PSU bootstrap is a documented library extension, not a paper-mandated or R-parity behavior, so I am not treating that extension itself as a defect. Concrete fix: none. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L19 docs/methodology/REGISTRY.md#L664

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: the broader “cell-period survey IF expansion is a library convention, not a formal observation-level derivation” limitation remains already tracked and documented; this PR does not make that debt worse. Concrete fix: none required in this PR. TODO.md#L60 docs/methodology/REGISTRY.md#L663

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests cover the intended varying-PSU bootstrap path and the repaired constant-PSU terminal-missingness contract, but they still do not lock the Class A replicate allocator choice. Existing replicate tests mostly assert finiteness / df propagation, and the only allocator-sensitivity regression is for heterogeneity, not the main ATT surfaces. That left the P1 above unguarded. Concrete fix: add a Class A replicate regression that uses within-group-varying replicate ratios and asserts the intended allocator contract for at least overall_se and one horizon/placebo surface; the heterogeneity non-invariance fixture is a good template. tests/test_survey_dcdh.py#L1808 tests/test_survey_dcdh.py#L1729 tests/test_survey_dcdh_replicate_psu.py#L147

Path to Approval

Change _survey_se_from_group_if() so the new constant-PSU fallback only applies on Binder/TSL fits, not on replicate-weight fits.
Add a replicate-weight regression for the Class A ATT surfaces using within-group-varying replicate ratios, and assert the intended SE/inference contract explicitly.
If the allocator change for Class A replicate fits is intentional, update the REGISTRY/helper docs to state that methodology change explicitly before merge.

Static review only: numpy, pandas, and pytest are not installed in this environment, so I could not execute the added tests.

Round-6's constant-PSU fallback in `_survey_se_from_group_if` silently disabled the cell-period allocator for replicate-weight designs, because replicate designs have `resolved.psu is None` and the fallback routes `psu_arr is None` to the legacy group-level path. That's an allocator change on a Class A surface (overall DID_M, joiners, leavers, dynamic placebos) under per-row-varying replicate ratios, where cell and legacy allocators produce materially different Rao-Wu variances (same non-invariance PR #329 established for heterogeneity). Fix: restrict the round-6 fallback to the TSL branch by gating on `not resolved.uses_replicate_variance`. Replicate designs retain the cell-period allocator (the PR #323 Class A contract), and the sentinel-mass guard still fires on mass leakage when it applies. Regression: `TestReplicateClassA::test_att_cell_allocator_with_varying_replicate_ratios` constructs legacy and cell-level psi_obs on the fitted replicate design, feeds both through `compute_replicate_if_variance`, and asserts they produce materially different variances — locking the allocator contract so a future refactor that switches Class A to the legacy allocator would be detectable. Mirrors the heterogeneity non-invariance guard from PR #329's CI review round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T15:53:57Z

/ai-review

github-actions · 2026-04-19T16:02:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c642cee13a7df816c0701b90ab32a295891fea8

Overall Assessment
⚠️ Needs changes.

Executive Summary

The prior re-review P1 on Class A replicate routing appears fixed in code: _survey_se_from_group_if() now skips the new within-group-constant PSU fallback when uses_replicate_variance is true, so replicate ATT surfaces keep the cell allocator. diff_diff/chaisemartin_dhaultfoeuille.py:L5866-L5912
One P1 remains: the new sentinel-mass guard now executes before the replicate/TSL dispatch, so replicate-weight ATT fits inherit the new terminal-missingness ValueError, but the updated public contract still documents that carve-out only for within-group-varying PSU / analytical TSL / bootstrap paths. diff_diff/chaisemartin_dhaultfoeuille.py:L5889-L6004 docs/methodology/REGISTRY.md:L621-L664 CHANGELOG.md:L12-L13
The cell-level Hall-Mammen bootstrap extension itself is documented in the Methodology Registry as a library extension, so I am not treating the extension or the group-vs-cell dispatcher choice as a defect. docs/methodology/REGISTRY.md:L663-L664
Test coverage for varying-PSU bootstrap improved materially, but the new replicate regression still does not actually pin the fitted ATT allocator contract; it only shows two synthetic allocators differ. tests/test_survey_dcdh_replicate_psu.py:L173-L287
Static review only: numpy/pandas are unavailable in this environment, so I could not execute the added tests.

Methodology

Severity: P1. Impact: _survey_se_from_group_if() now raises the new sentinel-mass ValueError before it branches to compute_replicate_if_variance(), and the new fallback explicitly preserves the cell allocator on replicate designs. That means terminally-missing panels can now fail under replicate-weight ATT variance too, but the updated REGISTRY, changelog, and fit() docs still scope the new limitation to within-group-varying PSU / analytical TSL / bootstrap. This is an undocumented contract change on a supported variance path. Concrete fix: either document the replicate-weight limitation in the public contract (REGISTRY.md, CHANGELOG.md, and the fit() docstring) and add a matching regression, or move the guard behind the replicate branch if replicate fits are supposed to retain the prior behavior. diff_diff/chaisemartin_dhaultfoeuille.py:L665-L684 diff_diff/chaisemartin_dhaultfoeuille.py:L5889-L6004 docs/methodology/REGISTRY.md:L621-L664 CHANGELOG.md:L12-L13
Severity: P3. Impact: The cell-level wild PSU bootstrap is a documented library extension beyond the dCDH papers / R reference, so the extension itself is not a methodology defect. Concrete fix: none. docs/methodology/REGISTRY.md:L663-L664

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: The broader survey cell-period allocator convention remains explicitly tracked in TODO.md; this PR does not make that already-documented debt worse. Concrete fix: none required in this PR. TODO.md:L51-L60

Security

No findings.

Documentation/Tests

Severity: P2. Impact: test_att_cell_allocator_with_varying_replicate_ratios claims to lock the repaired Class A replicate contract, but it never compares the fitted overall_se**2 to either reconstructed allocator variance; it only proves two synthetic allocator constructions differ. A future refactor could reintroduce the prior replicate-routing bug and this test would still pass. Concrete fix: capture the fitted ATT variance path and assert that the fitted overall_se**2 matches the intended cell allocator and differs from the legacy group allocator on the same fixture. tests/test_survey_dcdh_replicate_psu.py:L173-L287

Path to Approval

Decide the intended terminal-missingness contract for replicate-weight ATT variance, then make code and docs agree. If the new behavior is intentional, update docs/methodology/REGISTRY.md, CHANGELOG.md, and the fit() docstring to say the guard applies to replicate designs too.
Add a replicate-weight regression on a terminally-missing panel that asserts the intended behavior explicitly.

…ss A regression **P1 (docs scope carve-out missing for replicate ATT):** the sentinel-mass `ValueError` fires BEFORE the replicate-vs-TSL dispatch inside `_survey_se_from_group_if`, so replicate-weight ATT fits (which always use the cell-period allocator per the PR #323 Class A contract) also raise on terminal missingness. The docs previously scoped the carve-out to "within-group-varying PSU / analytical TSL / bootstrap" and silently omitted the replicate path. Rewrite the scope note in REGISTRY.md (both the balanced- baseline Note and the Survey + bootstrap contract Note), CHANGELOG, and the fit() docstring to list all three affected paths explicitly: Binder TSL with within-group-varying PSU, Rao-Wu replicate ATT, and the cell-level wild PSU bootstrap. Clarify that the `psu=<group_col>` workaround applies ONLY to Binder TSL — replicate ATT and the varying-PSU bootstrap have no allocator fallback; the only workaround for those paths is pre-processing the panel to remove terminal missingness. **P2 (synthetic allocator test didn't pin fitted SE):** the previous `test_att_cell_allocator_with_varying_replicate_ratios` showed that synthetic cell and legacy allocators differ, but never tied the difference back to the fit's `overall_se`. A refactor switching Class A replicate to the legacy allocator could have slipped through unnoticed. Rewrite using a pytest monkeypatch spy on `compute_replicate_if_variance` to capture the psi_obs the fit actually passes, then reconstruct the legacy-equivalent psi_obs via the row-sum identity on the SAME captured `U_centered` and verify: - fit's `overall_se**2` equals the recomputed variance on the captured cell-allocator psi_obs (sanity / dispatch alignment); - the legacy-reconstructed psi_obs produces a materially different Rao-Wu variance on the same fixture — so a silent Class A allocator regression would flip overall_se and fail this guard. No code behavior changed in round-8 beyond the doc updates — only the test and documentation now lock the contract explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T16:09:33Z

/ai-review

github-actions · 2026-04-19T16:17:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 785cb53a4f515cdd087492223adb94b4b4699f8d

Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings.

Executive Summary

The prior P1 is resolved: the new terminal-missingness contract is now documented consistently in the fit() docstring, the Methodology Registry, and the changelog, and the replicate ATT path still reaches the guard because the new Binder fallback explicitly skips replicate designs. chaisemartin_dhaultfoeuille.py:661 REGISTRY.md:621 CHANGELOG.md:13 chaisemartin_dhaultfoeuille.py:5894
The prior replicate-regression concern is materially addressed: the new spy-based test now pins the fitted ATT variance path to the actual psi_obs passed into compute_replicate_if_variance() and checks that the legacy allocator would differ on the same fixture. test_survey_dcdh_replicate_psu.py:173
Methodology-wise, the affected code paths are _compute_dcdh_bootstrap() and _survey_se_from_group_if(). I did not find an undocumented estimator or variance mismatch: the cell-level Hall-Mammen PSU bootstrap is explicitly documented as a library extension, and PSU-within-group-constant cases still route through the legacy path for bit-identity. chaisemartin_dhaultfoeuille_bootstrap.py:91 chaisemartin_dhaultfoeuille_bootstrap.py:217 REGISTRY.md:669
One remaining P2 test gap: there is still no end-to-end replicate-weight regression for the newly documented terminal-missingness ValueError; current terminal-missingness tests cover analytical TSL and bootstrap only. test_survey_dcdh.py:2105 test_survey_dcdh_replicate_psu.py:173
Static review only: this environment does not have numpy, pandas, or pytest, so I could not execute the changed tests.

Methodology

Severity: P3. Impact: The PR changes the dCDH survey bootstrap / survey-variance contract, but the Methodology Registry now explicitly documents both the cell-level wild PSU bootstrap and the terminal-missingness carve-out, and the implementation matches that documented contract. Concrete fix: none. chaisemartin_dhaultfoeuille_bootstrap.py:217 chaisemartin_dhaultfoeuille.py:5871 REGISTRY.md:621 REGISTRY.md:669

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: _compute_dcdh_bootstrap()’s docstring still describes 3-tuples and a purely per-group weight-matrix flow, but the helper now consumes 4-tuples plus cell-level per-period tensors and can bootstrap over unrolled cell-level PSU maps. That is only an internal-docs mismatch, but it makes the new contract harder to maintain safely. Concrete fix: update the helper docstring to describe the 4th tuple element and the cell-level dispatch path. chaisemartin_dhaultfoeuille_bootstrap.py:123

Tech Debt

Severity: P3. Impact: The broader survey cell-period allocator remains an accepted library convention rather than a paper-derived observation-level linearization; that limitation is already tracked in TODO.md and re-documented in the Registry, so it is non-blocking under the project’s review policy. Concrete fix: none required in this PR. TODO.md:60 REGISTRY.md:668

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new replicate test locks allocator choice on a regular panel, and the new terminal-missingness tests cover analytical TSL / bootstrap, but there is still no end-to-end regression asserting the documented replicate ATT failure mode on a terminally missing panel. A future refactor could relax or bypass that replicate error path without a direct test catching it. Concrete fix: add a replicate-weight fixture with a terminally missing group in a surviving cohort and assert that fit() raises the targeted ValueError on the replicate ATT path. chaisemartin_dhaultfoeuille.py:5894 REGISTRY.md:621 test_survey_dcdh.py:2105 test_survey_dcdh_replicate_psu.py:173

…string Round-8 cleared all P0/P1 findings. Addressing the two remaining P2/P3s with concrete fixes. **P2 (tests/docs): replicate ATT terminal-missingness failure mode lacked an end-to-end regression.** The round-8 docs extended the carve-out to replicate ATT, but no fit()-level test asserted it. Added `TestReplicateClassA::test_att_raises_on_terminal_missingness_replicate_path`: fixture has a terminally-missing joiner group in a cohort whose surviving peers serve as stable_1 controls at the missing period (leaking non-zero centered IF mass onto the missing cells). With SDR replicate weights and per-row-varying ratios, the fit must raise the documented `ValueError` from `_survey_se_from_group_if`'s sentinel-mass guard. **P3 (maintainability): `_compute_dcdh_bootstrap` docstring was stale.** The docstring still described 3-tuple inputs and a purely per-group weight-matrix flow, even though PR 4 changed every per-target input to a 4-tuple and added the cell-level dispatch path. Rewrote the Parameters section to enumerate every kwarg with correct shapes, and added explicit entries for the PR-4 additions (multi_horizon_inputs, placebo_horizon_inputs, group_id_to_psu_code, eligible_group_ids, u_per_period_overall, psu_codes_per_cell) including the dispatcher's routing rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T16:40:32Z

/ai-review

github-actions · 2026-04-19T16:53:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b3ac313e4b96db9eaa884492761fb2e99030a983

Overall Assessment
✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The affected methods are the dCDH survey-variance / survey-bootstrap paths: _survey_se_from_group_if() and _compute_dcdh_bootstrap(). Using the Methodology Registry and in-code references, I did not find an undocumented estimator, weighting, or variance/SE deviation in the changed implementation. docs/methodology/REGISTRY.md:L603-L669 diff_diff/chaisemartin_dhaultfoeuille.py:L5772-L6017 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L250-L598
The new cell-level Hall-Mammen PSU bootstrap is explicitly documented as a library extension, and the terminal-missingness rejection is now documented consistently in the Registry and changelog, so those are mitigated P3s rather than defects. docs/methodology/REGISTRY.md:L621-L669 CHANGELOG.md:L12-L13 TODO.md:L59-L60
The prior re-review test gap is addressed: there is now an end-to-end replicate ATT regression for the documented terminal-missingness ValueError, plus new varying-PSU bootstrap regressions and a slow coverage test for the shared-PSU multi-horizon path. tests/test_survey_dcdh_replicate_psu.py:L173-L349 tests/test_survey_dcdh.py:L1176-L1210 tests/test_survey_dcdh.py:L1856-L2315 tests/test_dcdh_bootstrap_cell_period_coverage.py:L91-L244
One remaining P3 maintainability issue: _compute_dcdh_bootstrap()’s opening docstring still reads like a purely per-group W @ u_centered / divisor algorithm even though the varying-PSU branch is now per-cell and PSU-broadcasted. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L123-L153 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L250-L305 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L417-L545
Static review only: this environment does not have pytest, numpy, pandas, or scipy, so I could not execute the changed tests.

Methodology

Severity: P3. Impact: The PR changes the ChaisemartinDHaultfoeuille survey bootstrap / survey IF-expansion contract, but the affected behavior is now explicitly documented as a library extension and the implementation matches that documented contract: analytical Section 3.7.3 variance remains the paper-backed baseline; the cell-level wild PSU bootstrap and the terminal-missingness carve-out are documented deviations/extensions, not undocumented methodology drift. Concrete fix: none. docs/methodology/REGISTRY.md:L603-L669 diff_diff/chaisemartin_dhaultfoeuille.py:L5871-L5975 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L19-L39

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: _compute_dcdh_bootstrap()’s top-level docstring was partially updated, but its opening summary and numbered algorithm still describe only a per-group weight matrix and W @ u_centered / divisor. That no longer matches the real control flow once psu_varies=True, where the helper unrolls per-cell IF tensors and, for event-study, reuses a shared PSU-level draw matrix. This makes future review/refactor work harder than necessary. Concrete fix: rewrite the introductory bullets / numbered steps to describe both dispatch modes explicitly: legacy group-level path vs cell-level PSU path. diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L123-L153 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L250-L305 diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L417-L545

Tech Debt

Severity: P3. Impact: The broader survey cell-period allocator remains a documented library convention rather than a paper-derived observation-level survey linearization; that limitation is already tracked in TODO.md and re-documented in REGISTRY.md, so it is non-blocking under the project’s review policy. Concrete fix: none in this PR. TODO.md:L59-L60 docs/methodology/REGISTRY.md:L668-L669

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The prior P2 regression-test gap from the last review is resolved. The PR now covers the documented replicate ATT terminal-missingness failure mode, end-to-end varying-PSU bootstrap success/failure cases, and a slow Monte Carlo coverage check for the shared-PSU multi-horizon branch. I did not identify a new blocking docs/test gap in the changed files. Concrete fix: none. tests/test_survey_dcdh_replicate_psu.py:L173-L349 tests/test_survey_dcdh.py:L1176-L1210 tests/test_survey_dcdh.py:L2105-L2243 tests/test_dcdh_bootstrap_cell_period_coverage.py:L91-L244
Runtime verification note: pytest, numpy, pandas, and scipy are unavailable here, so this remains a static review.

…ch modes Round-9 cleared every finding except one P3: the bootstrap helper's opening summary still read as a purely per-group W @ u_centered / divisor flow, which no longer matches the cell-level dispatch under within-group-varying PSU. Rewrote the introductory numbered steps to describe both paths explicitly — (a) legacy group-level path under PSU-within-group- constant regimes and (b) cell-level wild PSU path under within- group-varying PSU — and cross-reference `_unroll_target_to_cells`, the terminal-missingness guard, and the shared-PSU-weights multi- horizon machinery. Also re-flagged the row-sum identity that makes the two paths statistically equivalent under PSU=group. No code behavior change. Targeted tests (269) all pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T16:58:56Z

/ai-review

github-actions · 2026-04-19T17:10:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ae5838576f471a6ae1481f7460a6c65d69c3045c

Overall Assessment

✅ Looks good. No unmitigated P0 or P1 findings.

Executive Summary

The affected methods are the dCDH survey variance/bootstrap paths: _survey_se_from_group_if(), _compute_dcdh_bootstrap(), _cohort_recenter_per_period(), and the fit-side bootstrap input builder in diff_diff/chaisemartin_dhaultfoeuille.py:L5720, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L91, diff_diff/chaisemartin_dhaultfoeuille.py:L5250, and diff_diff/chaisemartin_dhaultfoeuille.py:L2302.
Against the Methodology Registry, the new cell-level wild PSU bootstrap and the terminal-missingness carve-out are documented library extensions/deviations, not undocumented estimator drift. See docs/methodology/REGISTRY.md:L603, docs/methodology/REGISTRY.md:L621, and docs/methodology/REGISTRY.md:L668.
The prior re-review P3 about _compute_dcdh_bootstrap()’s docstring drift is resolved; the module and method docstrings now describe both the legacy group-level path and the new cell-level PSU path. See diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L1 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L120.
The new regression coverage hits the varying-PSU bootstrap path, the terminal-missingness reject path, and the PSU=group fallback. See tests/test_survey_dcdh.py:L1176, tests/test_survey_dcdh.py:L2105, and tests/test_survey_dcdh_replicate_psu.py:L173.
One remaining P3 is user-facing wording: the shared terminal-missingness ValueError in _survey_se_from_group_if() still describes the failure as an “analytical” SE issue tied to “within-group-varying PSU,” which is misleading on the replicate ATT path. See diff_diff/chaisemartin_dhaultfoeuille.py:L5945, diff_diff/chaisemartin_dhaultfoeuille.py:L6008, and docs/methodology/REGISTRY.md:L621.
Static review only: pytest, numpy, pandas, and scipy are unavailable in this environment.

Methodology

No findings. The changed estimator/math surface matches the documented dCDH survey contract: analytical Section 3.7.3 variance remains the paper-backed baseline, while the cell-level wild PSU bootstrap and terminal-missingness rejection are explicitly documented library extensions/deviations. See docs/methodology/REGISTRY.md:L603, docs/methodology/REGISTRY.md:L621, docs/methodology/REGISTRY.md:L668, diff_diff/chaisemartin_dhaultfoeuille.py:L5720, and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L287.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The previously noted _compute_dcdh_bootstrap() documentation drift appears resolved. See diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L1 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L120.

Tech Debt

Severity: P3-informational. Impact: The broader survey cell-period allocator remains a documented library convention rather than a paper-derived observation-level survey linearization, but this limitation is already tracked in TODO.md and in the Methodology Registry, so it is mitigated under the project policy. Concrete fix: none required in this PR. See TODO.md:L59 and docs/methodology/REGISTRY.md:L668.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The terminal-missingness ValueError raised from _survey_se_from_group_if() says “Analytical survey SE” and attributes the failure to “within-group-varying PSU,” but the same guard is also the documented failure mode for replicate-weight ATT, where no PSU structure is involved. That can misdirect users debugging replicate designs even though the underlying behavior is correct. Concrete fix: branch the message on resolved.uses_replicate_variance, or generalize it to a PSU-agnostic “cell-period allocator survey SE path” message and keep the PSU-specific wording only for the Binder TSL case. See diff_diff/chaisemartin_dhaultfoeuille.py:L5945, diff_diff/chaisemartin_dhaultfoeuille.py:L6008, docs/methodology/REGISTRY.md:L621, and tests/test_survey_dcdh_replicate_psu.py:L173.

The shared terminal-missingness ValueError in `_survey_se_from_group_if` said "Analytical survey SE" and attributed the failure to "within- group-varying PSU" — misleading for replicate ATT users, since replicate designs carry no PSU structure and still hit the same guard. Branch the wording on `resolved.uses_replicate_variance`: - Replicate ATT: "Rao-Wu replicate-weight ATT variance" + note that replicate ATT unconditionally uses the cell-period allocator per the Class A contract, PSU is not involved. - Binder TSL: keeps the original "Analytical Binder TSL survey SE" wording + the within-group-varying-PSU trigger explanation. Shared mechanic and pre-processing workaround kept the same. The regression tests still match on "no positive-weight observations" which appears in both branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T17:13:44Z

/ai-review

github-actions · 2026-04-19T17:26:23Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a520c5281a4f81971e84aa6108b94e22f3369a4f

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The affected methods are the dCDH survey/bootstrap inference paths: the bootstrap-input builder in [diff_diff/chaisemartin_dhaultfoeuille.py:L2302-L2432](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L2302), _compute_dcdh_bootstrap() / _unroll_target_to_cells() in [diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L91-L744](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L91), and _survey_se_from_group_if() in [diff_diff/chaisemartin_dhaultfoeuille.py:L5772-L6009](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L5772).
Methodology-wise, I did not find an undocumented estimator or variance drift. The analytical Section 3.7.3 plug-in variance remains intact, while the cell-level wild PSU bootstrap and terminal-missingness rejection are explicitly documented library extensions in [docs/methodology/REGISTRY.md:L603-L669](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L603).
The prior re-review P3 about the analytical/replicate terminal-missingness wording is resolved: _survey_se_from_group_if() now distinguishes Binder TSL from replicate-weight ATT in its ValueError text at [diff_diff/chaisemartin_dhaultfoeuille.py:L5961-L5998](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L5961).
Remaining issue 1 is P3 only: _unroll_target_to_cells() still suggests psu=<group_col> as a bootstrap workaround for terminal missingness, which contradicts the Registry’s “no allocator fallback” contract for the within-group-varying-PSU bootstrap.
Remaining issue 2 is P3 only: the new bootstrap warning heuristic is broader than the stated “strictly coarser PSU only” contract and can warn on supported varying-PSU designs with shared PSU labels.
Static review only: executable validation was not possible here because pytest, numpy, and pandas are unavailable in this environment.

Methodology

No findings. The changed inference surface matches the documented contract in [docs/methodology/REGISTRY.md:L621-L669](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L621): analytical dCDH variance still follows the paper-backed plug-in path, while the within-group-varying-PSU bootstrap and the terminal-missingness guard are explicitly documented library extensions/deviations rather than undocumented methodology changes.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: the new warning block says the Hall-Mammen warning should fire only when PSU is “strictly coarser than group,” but the actual predicate is n_psu_eff_warn < n_groups_eff_warn, which can also fire for supported within-group-varying designs that reuse PSU labels across groups. That makes the code/comment/Registry contract internally inconsistent, even if the warning itself is harmless. See [diff_diff/chaisemartin_dhaultfoeuille.py:L2128-L2168](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L2128) and [docs/methodology/REGISTRY.md:L669](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L669). Concrete fix: either suppress this warning on the varying-PSU cell-level path, or update the comment/Registry to describe the broader n_psu < n_groups heuristic instead of “strictly coarser PSU only.”

Tech Debt

No new findings. The broader limitation that the survey cell-period allocator is a documented library convention rather than a formal observation-level derivation remains tracked and mitigated in [TODO.md:L51-L60](/home/runner/work/diff-diff/diff-diff/TODO.md#L51) and [docs/methodology/REGISTRY.md:L668-L669](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L668).

Security

No findings.

Documentation/Tests

Severity: P3. Impact: _unroll_target_to_cells() still tells users they can work around bootstrap terminal-missingness by switching to psu=<group_col>, but the Registry now says that fallback is only valid for analytical Binder TSL; for the within-group-varying-PSU bootstrap there is no allocator fallback and the panel must be pre-processed. That is a user-facing contract mismatch, not a numerical bug. See [diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L690-L743](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L690) and [docs/methodology/REGISTRY.md:L621-L626](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L621). Concrete fix: align the _unroll_target_to_cells() docstring and ValueError text with the Registry: keep the psu=<group_col> workaround only for the analytical Binder TSL path, and direct bootstrap users to pre-process the panel instead.

…= workaround text **P3 #1 (warning predicate inconsistent with "strictly coarser PSU" contract):** the new bootstrap warning block's comment said the warning fires only on strictly-coarser PSU designs, but the predicate `n_psu_eff_warn < n_groups_eff_warn` could also fire on supported varying-PSU designs whose eligible groups happened to share PSU labels across groups. Detect within-group-varying PSU explicitly (`.groupby("g")["p"].nunique().gt(1).any()`) and suppress the warning in that regime. Under auto-inject PSU=group and under within-group-varying PSU the warning now stays silent, matching the stated contract. **P3 #2 (`_unroll_target_to_cells` suggested `psu=<group_col>` as a bootstrap workaround):** the Registry / CHANGELOG already clarified that `psu=<group_col>` is ONLY a Binder TSL workaround; the cell- level wild PSU bootstrap has no allocator fallback. The helper's docstring and `ValueError` message still advertised it as a bootstrap-path workaround. Dropped that suggestion and explicitly clarified: the varying-PSU bootstrap IS the cell-level path, so there is no legacy-allocator alternative to fall back to — pre-processing the panel is the only workaround on the bootstrap side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T17:31:40Z

/ai-review

github-actions · 2026-04-19T17:43:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b67c873ffa430f2db2c162179ea529f857613c75

Overall assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive summary

The affected methods are the dCDH survey/bootstrap inference paths in diff_diff/chaisemartin_dhaultfoeuille.py:2105, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:91, and diff_diff/chaisemartin_dhaultfoeuille.py:5791.
Methodology-wise, I did not find an undocumented estimator, weighting, or variance change. The analytical paper-backed plug-in path remains intact, while the cell-level wild PSU bootstrap and the terminal-missingness rejection are now explicitly documented in docs/methodology/REGISTRY.md:603, docs/methodology/REGISTRY.md:621, docs/methodology/REGISTRY.md:668, and docs/methodology/REGISTRY.md:669.
The two prior P3 items from the last AI re-review appear addressed: the bootstrap warning heuristic now suppresses the varying-PSU regime in diff_diff/chaisemartin_dhaultfoeuille.py:2128, and _unroll_target_to_cells() no longer suggests psu=<group_col> as a bootstrap workaround in diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:666.
Edge-case handling is materially stronger in the changed code: the PR adds explicit guards for missing per-cell tensors, shape mismatches, sentinel-mass leakage, and replicate-path terminal missingness in tests/test_survey_dcdh.py:1804 and tests/test_survey_dcdh_replicate_psu.py:171.
Static review only: executable validation was not possible here because this environment does not have numpy or pytest installed.

Methodology

Severity: None. Impact: No undocumented methodology mismatch found. The changed bootstrap dispatch and analytical fallback behavior align with the updated registry contract in docs/methodology/REGISTRY.md:603, docs/methodology/REGISTRY.md:621, and docs/methodology/REGISTRY.md:669. Concrete fix: None.

Code Quality

Severity: None. Impact: No new inline-inference anti-patterns, NaN/SE inconsistencies, or incomplete propagation issues stood out in the changed files. The new bootstrap inputs are threaded through overall, joiners/leavers, multi-horizon, and placebo-horizon paths in diff_diff/chaisemartin_dhaultfoeuille.py:2202 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:301. Concrete fix: None.

Performance

Severity: None. Impact: I did not identify a performance regression serious enough to warrant changes in this PR. Concrete fix: None.

Maintainability

Severity: None. Impact: The prior internal-contract mismatch around the “strictly coarser PSU” warning appears resolved in diff_diff/chaisemartin_dhaultfoeuille.py:2128, and the bootstrap terminal-missingness messaging is now aligned with the registry in diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:666. Concrete fix: None.

Tech Debt

Severity: P3-informational. Impact: The broader limitation that the survey cell-period allocator is a documented library convention rather than a formal observation-level derivation remains tracked in TODO.md:60 and documented in docs/methodology/REGISTRY.md:668, so it is mitigated and not a blocker for this PR. Concrete fix: None required in this PR.

Security

Severity: None. Impact: No security issues, secrets exposure, or unsafe handling of untrusted PR text were introduced by the changed code. Concrete fix: None.

Documentation/Tests

Severity: None. Impact: The docs and tests were updated in step with the behavior change, including the new survey/bootstrap contract notes in docs/methodology/REGISTRY.md:621 and docs/methodology/REGISTRY.md:669, plus direct regression coverage in tests/test_survey_dcdh.py:1174, tests/test_survey_dcdh.py:1804, tests/test_survey_dcdh_replicate_psu.py:171, and tests/test_dcdh_bootstrap_cell_period_coverage.py:1. Concrete fix: None.

Recaptured the baseline on origin/main@ac181b7f (the stated reference SHA). Current dispatcher reproduces this value bit-for-bit, confirming the PSU-within-group-constant legacy-path routing is intact. The prior pinned value was captured on an intermediate dev checkpoint rather than the merged pre-PR-4 SHA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rust and pure-Python backends take different numeric paths through `_generate_bootstrap_weights_batch` and `solve_ols`, so the pinned pre-PR-4 SE differs between them. Both values were captured on origin/main@ac181b7f under each backend; bit-identity holds within each backend independently. Select the baseline at test time based on the resolved backend (mirrors `_PURE_PYTHON_MODE` detection in conftest.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 19, 2026 09:54

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber and others added 2 commits April 19, 2026 14:20

igerber merged commit e834a48 into main Apr 19, 2026
19 checks passed

igerber deleted the feature/dcdh-bootstrap-cell-period branch April 19, 2026 20:31

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant